2023 - MSc AIES

1. Gaussian Model Predictions

A model $m_1$ for observations $x \in \mathbb{R}$ is specified as follows: $p(x|\mu, m_1) = \mathcal{N}(x|\mu, 1)$ $p(\mu|m_1) = \mathcal{N}(\mu|0, 1)$

(Consult the Formula sheet in the preamble of the exam for Gaussian distribution rules).

1a. We make an observation $x=1$. Determine the posterior $p(\mu|x=1, m_1)$. - a) $\mathcal{N}(\mu|0, 0.5)$ - b) $\mathcal{N}(\mu|1, 2)$ - c) $\mathcal{N}(\mu|0.5, 0.5)$ - d) $\mathcal{N}(\mu|0.5, 1)$

1b. Determine the evidence $p(x=1|m_1)$ for model $m_1$, based on observation $x=1$. - a) $\mathcal{N}(1|0, 2)$ - b) $2/\sqrt{2\pi}$ - c) $\mathcal{N}(0|1, 1)$ - d) $1/\sqrt{2\pi}$

Consider a second model $m_2$ specified as: $p(x|m_2) = \mathcal{N}(x|1, 1)$ The model priors are given by $p(m_1) = 2/3$ and $p(m_2) = 1/3$.

1c. Determine the probability $p(x=2)$ by Bayesian model averaging over both $m_1$ and $m_2$. - a) $\frac{2}{3\sqrt{2\pi}} + \frac{1}{3}\mathcal{N}(2|0, 1)$ - b) $\frac{1}{3}\mathcal{N}(2|1, 2) + \frac{1}{3\sqrt{2\pi}}$ - c) $\frac{2}{3}\mathcal{N}(2|0, 2) + \frac{1}{3}\mathcal{N}(2|1, 1)$ - d) $\frac{1}{3\sqrt{2\pi}} + \frac{1}{3}\mathcal{N}(2|1, 1)$

2. Classification

You have a machine that measures property $x$, the "transparency" of oils. You wish to discriminate between $C_1 = $ 'olive oil' and $C_2 = $ 'grape seed oil'. It is known that:

$p(x|C_1) = \begin{cases} -6(x-1)(x-2) & \text{if } 1 \le x \le 2 \\ 0 & \text{otherwise} \end{cases}$ $p(x|C_2) = \begin{cases} 4-2x & \text{if } 1 \le x \le 2 \\ 0 & \text{otherwise} \end{cases}$

The probability that $x$ falls outside the interval $[1.0, 2.0]$ is zero. The prior class probabilities $p(C_1) = 0.4$ and $p(C_2) = 0.6$ are also known from experience.

2a. Compute $p(C_1|x=4/3)$. - a) $2/3$ - b) $3/4$ - c) $3/5$ - d) $4/10$

2b. A "Bayes Classifier" is given by: $\text{Decision} = \begin{cases} C_1 & \text{if } p(C_1|x) > p(C_2|x) \\ C_2 & \text{otherwise} \end{cases}$ The Bayes classifier for this problem is given by: - a) $\text{Decision} = $\begin{cases} C_1 & \text{if } 4/3 < x \le 2 \\ C_2 & \text{otherwise} \end{cases}$ $ - b) $\text{Decision} = $\begin{cases} C_1 & \text{if } 3/2 < x < 2 \\ C_2 & \text{otherwise} \end{cases}$ $ - c) $\text{Decision} = $\begin{cases} C_1 & \text{if } 5/3 < x \le 2 \\ C_2 & \text{otherwise} \end{cases}$ $ - d) $\text{Decision} = $\begin{cases} C_1 & \text{if } 7/4 < x \le 2 \\ C_2 & \text{otherwise} \end{cases}$ $

2c. Let the discrimination boundary be given by $x=a$. You make an observation and want to classify the observation. Work out the probability of making a wrong classification decision. - a) $0.4\int_{a}^{2}(4-2x)dx + 0.6\int_{1}^{a}6(1-x)(x-2)dx$ - b) $0.6\int_{a}^{2}(4-2x)dx + 0.4\int_{1}^{a}6(1-x)(x-2)dx$ - c) $a/2$ - d) $\int_{a}^{2}(4-2x)dx + \int_{1}^{a}6(1-x)(x-2)dx$

3. Coin Toss Prediction

Consider a coin with outcomes: $x_n = \begin{cases} 0 & \text{if tails is observed} \\ 1 & \text{if heads is observed} \end{cases}$

We assume that the data-generating process is governed by a Bernoulli distribution, $p(x_n|\mu) = \mu^{x_n}(1-\mu)^{(1-x_n)}$ and we assume a Beta distribution for the prior on $\mu$: $p(\mu) = Beta(\mu|\alpha=3, \beta=2)$

We throw the coin 5 times and observe outcomes $D = {0, 1, 1, 0, 1}$. (Consult the Formula sheet in the preamble of the exam for Beta distribution rules).

3a. Work out the likelihood function $p(D|\mu)$ for $\mu$. - a) $p(D|\mu) = \mu^3(1-\mu)^2$ - b) $p(D|\mu) = \binom{5}{3} \cdot \mu^2(1-\mu)^3$ - c) $p(D|\mu) = \binom{5}{2} \cdot \mu^3(1-\mu)^2$ - d) $p(D|\mu) = \binom{3}{2} \cdot \mu^3(1-\mu)^2$

3b. Compute the posterior distribution $p(\mu|D)$. - a) $p(\mu|D) = \binom{5}{2} \cdot \mu^3(1-\mu)^2$ - b) $p(\mu|D) = Beta(\mu|6, 4)$ - c) $p(\mu|D) = Beta(\mu|5, 5)$ - d) $p(\mu|D) = \mu^3(1-\mu)^2 \cdot Beta(\mu|\alpha=3, \beta=2)$

3c. Compute the evidence $p(D)$. - a) $p(D) = \frac{\Gamma(4)\Gamma(6)}{\Gamma(10)}$ - b) $p(D) = \frac{\Gamma(4)\Gamma(5)\Gamma(6)}{\Gamma(2)\Gamma(3)\Gamma(10)}$ - c) $p(D) = \frac{\Gamma(5)}{\Gamma(2)\Gamma(3)}$ - d) $p(D) = \frac{\Gamma(5)\Gamma(10)}{\Gamma(2)\Gamma(3)\Gamma(4)\Gamma(6)}$

3d. Now compute the probability for throwing heads after the data set has been absorbed in the model. - a) $p(x_{n+1}=1|D) = Beta(0.6|6, 4)$ - b) $p(x_{n+1}=1|D) = 0.6$ - c) $p(x_{n+1}=1|D) = 0.7$ - d) $p(x_{n+1}=1|D) = 0.4$

4. Miscellaneous comprehension

4a. Consider a state-space model-based Active Inference agent that interacts with its world. Which of the following statements about the agent's computations is most consistent with Friston's FEP? - a) Perception minimizes the complexity of the states. - b) The agent infers actions by maximizing the free energy in future states. - c) The agent infers actions by maximizing the expected accuracy in future states. - d) The agent infers actions by minimizing the expected free energy in future states.

4b. Overfitting occurs when a model fits too closely to a training data set, resulting in poor generalization. Why is a "Bayesian engineer" usually not very concerned about overfitting? - a) Bayesian modeling aims to maximize (log-) model evidence, which decomposes as "training data fit minus model complexity". The complexity term prevents overfitting on the training data. - b) Bayesian modeling uses a separate test data set to check the generalization properties of the model. - c) Bayesian modeling uses probability theory to minimize the probability of overfitting as the training data set grows. - d) Bayesian modeling aims to maximize (log-) model evidence, which decomposes as "training data fit minus entropy of the model parameters". The entropy term prevents overfitting on the training data.

4c. Consider a data set ${x_n | n=1,2,...,N}$ with $x_n \in \mathbb{R}^M$ and a set of latent one-hot coded variables $z_n = (z_{n1}, z_{n2}, ..., z_{nK})$, i.e., $z_{nk} \in {0, 1}$ and $\sum_{k=1}^{K} z_{nk} = 1$. Which of the following is a correct specification for a Gaussian Mixture Model? - a) $p(x_n, z_n) = \prod_{k=1}^{K} \pi_k \cdot \mathcal{N}(x_n|\mu_k, \Sigma_k)$ - b) $p(x_n, z_n) = \prod_{k=1}^{K} (\pi_k \cdot \mathcal{N}(x_n|\mu_k, \Sigma_k))^{z_{nk}}$ - c) $p(x_n, z_n) = \prod_{k=1}^{K} \pi_k \cdot \mathcal{N}(x_n|\mu_k, \Sigma_k)^{z_{nk}}$ - d) $p(x_n, z_n) = \prod_{k=1}^{K} (\pi_k \cdot \mathcal{N}(x_n|\mu_k, \Sigma_k))^{z_n}$

4d. A dark bag contains five red balls and seven green ones. Balls are not returned to the bag after each draw. If you know that on the last draw the ball was a green one, what is the probability of drawing a red ball on the first draw? - a) $4/11$ - b) $5/11$ - c) $5/12$ - d) $6/11$

4e. Consider a (so-called Factor Analysis) model with specification: $x_n = \Lambda z_n + v_n$ $z_n \sim \mathcal{N}(0, I)$ $v_n \sim \mathcal{N}(0, \Psi)$ Furthermore, we assume that $\mathbb{E}[z_n v_n^T] = 0$. Evaluate $p(x_n)$. - a) $p(x_n) \sim \mathcal{N}(0, \Lambda\Lambda^T + \Psi)$ - b) $p(x_n) \sim \mathcal{N}(0, \Lambda\Lambda^T + \Psi^T)$ - c) $p(x_n) \sim \mathcal{N}(1, \Lambda + \Psi)$ - d) $p(x_n) \sim \mathcal{N}(0, \Lambda + \Psi)$

4f. Why can Variational Free Energy (VFE) minimization be interpreted as an approximation to Bayesian inference? - a) VFE minimization is a model for Bayesian inference plus a little bit of Gaussian noise. - b) VFE minimization minimizes the KL-divergence between the variational distribution and Bayesian evidence. Furthermore, the VFE itself is an upper bound on the Bayesian posterior distribution. Therefore, VFE minimization identifies approximations to both the posterior over latent variables and model evidence. - c) VFE minimization minimizes Bayesian evidence by optimizing the variational posterior. Therefore, VFE minimization identifies approximations to both the posterior over latent variables and model evidence. - d) VFE minimization minimizes the KL-divergence between the variational and Bayesian posterior distributions. Furthermore, the VFE itself is an upper bound to (negative log-) evidence. Therefore, VFE minimization identifies approximations to both the posterior over latent variables and model evidence.